{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "bccd47fc",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/vector_stores/postgres.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db0855d0",
   "metadata": {},
   "source": [
    "# Gel Vector Store\n",
    "[Gel](https://www.geldata.com/) is an open-source PostgreSQL data layer optimized for fast development to production cycle. It comes with a high-level strictly typed graph-like data model, composable hierarchical query language, full SQL support, migrations, Auth and AI modules."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "e4f33fc9",
   "metadata": {},
   "source": [
    "If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d2fc9c18",
   "metadata": {},
   "outputs": [],
   "source": [
    "! pip install gel llama-index-vector-stores-gel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "712daea5",
   "metadata": {},
   "outputs": [],
   "source": [
    "! pip install llama-index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c2d1c538",
   "metadata": {},
   "outputs": [],
   "source": [
    "# import logging\n",
    "# import sys\n",
    "\n",
    "# Uncomment to see debug logs\n",
    "# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)\n",
    "# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))\n",
    "\n",
    "from llama_index.core import SimpleDirectoryReader, StorageContext\n",
    "from llama_index.core import VectorStoreIndex\n",
    "from llama_index.vector_stores.gel import GelVectorStore\n",
    "import textwrap\n",
    "import openai"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "26c71b6d",
   "metadata": {},
   "source": [
    "### Setup OpenAI\n",
    "The first step is to configure the openai key. It will be used to created embeddings for the documents loaded into the index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "67b86621",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"<your key>\"\n",
    "openai.api_key = os.environ[\"OPENAI_API_KEY\"]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "eecf4bd5",
   "metadata": {},
   "source": [
    "Download Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6df9fa89",
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p 'data/paul_graham/'\n",
    "!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "f7010b1d-d1bb-4f08-9309-a328bb4ea396",
   "metadata": {},
   "source": [
    "### Loading documents\n",
    "Load the documents stored in the `data/paul_graham/` using the SimpleDirectoryReader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c154dd4b",
   "metadata": {},
   "outputs": [],
   "source": [
    "documents = SimpleDirectoryReader(\"./data/paul_graham\").load_data()\n",
    "print(\"Document ID:\", documents[0].doc_id)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7bd24f0a",
   "metadata": {},
   "source": [
    "### Create the Database\n",
    "\n",
    "In order to use Gel as a backend for your vectorstore, you're going to need a working Gel instance.\n",
    "Fortunately, it doesn't have to involve Docker containers or anything complicated, unless you want to!\n",
    "\n",
    "To set up a local instance, run:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "00a7a5e3",
   "metadata": {},
   "outputs": [],
   "source": [
    "! gel project init --non-interactive"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "009a5484",
   "metadata": {},
   "source": [
    "If you are using [Gel Cloud](cloud.geldata.com) (and you should!), add one more argument to that command:\n",
    "\n",
    "```bash\n",
    "gel project init --server-instance <org-name>/<instance-name>\n",
    "```\n",
    "\n",
    "For a comprehensive list of ways to run Gel, take a look at [Running Gel](https://docs.geldata.com/reference/running) section of the reference docs.\n",
    "\n",
    "### Set up the schema\n",
    "\n",
    "[Gel schema](https://docs.geldata.com/reference/datamodel) is an explicit high-level description of your application's data model. \n",
    "Aside from enabling you to define exactly how your data is going to be laid out, it drives Gel's many powerful features such as links, access policies, functions, triggers, constraints, indexes, and more.\n",
    "\n",
    "The LlamaIndex's `GelVectorStore` expects the following layout for the schema:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8f1bdf33",
   "metadata": {},
   "outputs": [],
   "source": [
    "schema_content = \"\"\"\n",
    "using extension pgvector;\n",
    "                                    \n",
    "module default {\n",
    "    scalar type EmbeddingVector extending ext::pgvector::vector<1536>;\n",
    "\n",
    "    type Record {\n",
    "        required collection: str;\n",
    "        text: str;\n",
    "        embedding: EmbeddingVector; \n",
    "        external_id: str {\n",
    "            constraint exclusive;\n",
    "        };\n",
    "        metadata: json;\n",
    "\n",
    "        index ext::pgvector::hnsw_cosine(m := 16, ef_construction := 128)\n",
    "            on (.embedding)\n",
    "    } \n",
    "}\n",
    "\"\"\".strip()\n",
    "\n",
    "with open(\"dbschema/default.gel\", \"w\") as f:\n",
    "    f.write(schema_content)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "958bdd1e",
   "metadata": {},
   "source": [
    "In order to apply schema changes to the database, run the migration using the Gel's [migration tool](https://docs.geldata.com/reference/datamodel/migrations):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fa92518f",
   "metadata": {},
   "outputs": [],
   "source": [
    "! gel migration create --non-interactive\n",
    "! gel migrate"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3665aad3",
   "metadata": {},
   "source": [
    "From this point onward, `GelVectorStore` can be used as a drop-in replacement for any other vectorstore available in LlamaIndex."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c0232fd1",
   "metadata": {},
   "source": [
    "### Create the index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8731da62",
   "metadata": {},
   "outputs": [],
   "source": [
    "vector_store = GelVectorStore()\n",
    "\n",
    "storage_context = StorageContext.from_defaults(vector_store=vector_store)\n",
    "index = VectorStoreIndex.from_documents(\n",
    "    documents, storage_context=storage_context, show_progress=True\n",
    ")\n",
    "query_engine = index.as_query_engine()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8ee4473a-094f-4d0a-a825-e1213db07240",
   "metadata": {},
   "source": [
    "### Query the index\n",
    "We can now ask questions using our index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0a2bcc07",
   "metadata": {},
   "outputs": [],
   "source": [
    "response = query_engine.query(\"What did the author do?\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8cf55bf7",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(textwrap.fill(str(response), 100))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "68cbd239-880e-41a3-98d8-dbb3fab55431",
   "metadata": {},
   "outputs": [],
   "source": [
    "response = query_engine.query(\"What happened in the mid 1980s?\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fdf5287f",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(textwrap.fill(str(response), 100))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2e5e8083",
   "metadata": {},
   "source": [
    "### Metadata filters\n",
    "\n",
    "GelVectorStore supports storing metadata in nodes, and filtering based on that metadata during the retrieval step."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2d0ad3fc",
   "metadata": {},
   "source": [
    "#### Download git commits dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "63e90a89",
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p 'data/git_commits/'\n",
    "!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/csv/commit_history.csv' -O 'data/git_commits/commit_history.csv'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fef41f44",
   "metadata": {},
   "outputs": [],
   "source": [
    "import csv\n",
    "\n",
    "with open(\"data/git_commits/commit_history.csv\", \"r\") as f:\n",
    "    commits = list(csv.DictReader(f))\n",
    "\n",
    "print(commits[0])\n",
    "print(len(commits))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3b0d9f47",
   "metadata": {},
   "source": [
    "#### Add nodes with custom metadata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3920109b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create TextNode for each of the first 100 commits\n",
    "from llama_index.core.schema import TextNode\n",
    "from datetime import datetime\n",
    "import re\n",
    "\n",
    "nodes = []\n",
    "dates = set()\n",
    "authors = set()\n",
    "for commit in commits[:100]:\n",
    "    author_email = commit[\"author\"].split(\"<\")[1][:-1]\n",
    "    commit_date = datetime.strptime(\n",
    "        commit[\"date\"], \"%a %b %d %H:%M:%S %Y %z\"\n",
    "    ).strftime(\"%Y-%m-%d\")\n",
    "    commit_text = commit[\"change summary\"]\n",
    "    if commit[\"change details\"]:\n",
    "        commit_text += \"\\n\\n\" + commit[\"change details\"]\n",
    "    fixes = re.findall(r\"#(\\d+)\", commit_text, re.IGNORECASE)\n",
    "    nodes.append(\n",
    "        TextNode(\n",
    "            text=commit_text,\n",
    "            metadata={\n",
    "                \"commit_date\": commit_date,\n",
    "                \"author\": author_email,\n",
    "                \"fixes\": fixes,\n",
    "            },\n",
    "        )\n",
    "    )\n",
    "    dates.add(commit_date)\n",
    "    authors.add(author_email)\n",
    "\n",
    "print(nodes[0])\n",
    "print(min(dates), \"to\", max(dates))\n",
    "print(authors)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a638f76a",
   "metadata": {},
   "outputs": [],
   "source": [
    "vector_store = GelVectorStore()\n",
    "\n",
    "index = VectorStoreIndex.from_vector_store(vector_store=vector_store)\n",
    "index.insert_nodes(nodes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "15f7cf45",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(index.as_query_engine().query(\"How did Lakshmi fix the segfault?\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ab03ed4",
   "metadata": {},
   "source": [
    "#### Apply metadata filters\n",
    "\n",
    "Now we can filter by commit author or by date when retrieving nodes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aa6212e7",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.vector_stores.types import (\n",
    "    MetadataFilter,\n",
    "    MetadataFilters,\n",
    ")\n",
    "\n",
    "filters = MetadataFilters(\n",
    "    filters=[\n",
    "        MetadataFilter(key=\"author\", value=\"mats@timescale.com\"),\n",
    "        MetadataFilter(key=\"author\", value=\"sven@timescale.com\"),\n",
    "    ],\n",
    "    condition=\"or\",\n",
    ")\n",
    "\n",
    "retriever = index.as_retriever(\n",
    "    similarity_top_k=10,\n",
    "    filters=filters,\n",
    ")\n",
    "\n",
    "retrieved_nodes = retriever.retrieve(\"What is this software project about?\")\n",
    "\n",
    "for node in retrieved_nodes:\n",
    "    print(node.node.metadata)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "67c19ec6",
   "metadata": {},
   "outputs": [],
   "source": [
    "filters = MetadataFilters(\n",
    "    filters=[\n",
    "        MetadataFilter(key=\"commit_date\", value=\"2023-08-15\", operator=\">=\"),\n",
    "        MetadataFilter(key=\"commit_date\", value=\"2023-08-25\", operator=\"<=\"),\n",
    "    ],\n",
    "    condition=\"and\",\n",
    ")\n",
    "\n",
    "retriever = index.as_retriever(\n",
    "    similarity_top_k=10,\n",
    "    filters=filters,\n",
    ")\n",
    "\n",
    "retrieved_nodes = retriever.retrieve(\"What is this software project about?\")\n",
    "\n",
    "for node in retrieved_nodes:\n",
    "    print(node.node.metadata)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f6e9cdf",
   "metadata": {},
   "source": [
    "#### Apply nested filters\n",
    "\n",
    "In the above examples, we combined multiple filters using AND or OR. We can also combine multiple sets of filters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "94f20be7",
   "metadata": {},
   "outputs": [],
   "source": [
    "filters = MetadataFilters(\n",
    "    filters=[\n",
    "        MetadataFilters(\n",
    "            filters=[\n",
    "                MetadataFilter(\n",
    "                    key=\"commit_date\", value=\"2023-08-01\", operator=\">=\"\n",
    "                ),\n",
    "                MetadataFilter(\n",
    "                    key=\"commit_date\", value=\"2023-08-15\", operator=\"<=\"\n",
    "                ),\n",
    "            ],\n",
    "            condition=\"and\",\n",
    "        ),\n",
    "        MetadataFilters(\n",
    "            filters=[\n",
    "                MetadataFilter(key=\"author\", value=\"mats@timescale.com\"),\n",
    "                MetadataFilter(key=\"author\", value=\"sven@timescale.com\"),\n",
    "            ],\n",
    "            condition=\"or\",\n",
    "        ),\n",
    "    ],\n",
    "    condition=\"and\",\n",
    ")\n",
    "\n",
    "retriever = index.as_retriever(\n",
    "    similarity_top_k=10,\n",
    "    filters=filters,\n",
    ")\n",
    "\n",
    "retrieved_nodes = retriever.retrieve(\"What is this software project about?\")\n",
    "\n",
    "for node in retrieved_nodes:\n",
    "    print(node.node.metadata)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "737692ce",
   "metadata": {},
   "source": [
    "The above can be simplified by using the IN operator. `GelVectorStore` supports `in`, `nin`, and `contains` for comparing an element with a list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "85faf8b3",
   "metadata": {},
   "outputs": [],
   "source": [
    "filters = MetadataFilters(\n",
    "    filters=[\n",
    "        MetadataFilter(key=\"commit_date\", value=\"2023-08-01\", operator=\">=\"),\n",
    "        MetadataFilter(key=\"commit_date\", value=\"2023-08-15\", operator=\"<=\"),\n",
    "        MetadataFilter(\n",
    "            key=\"author\",\n",
    "            value=[\"mats@timescale.com\", \"sven@timescale.com\"],\n",
    "            operator=\"in\",\n",
    "        ),\n",
    "    ],\n",
    "    condition=\"and\",\n",
    ")\n",
    "\n",
    "retriever = index.as_retriever(\n",
    "    similarity_top_k=10,\n",
    "    filters=filters,\n",
    ")\n",
    "\n",
    "retrieved_nodes = retriever.retrieve(\"What is this software project about?\")\n",
    "\n",
    "for node in retrieved_nodes:\n",
    "    print(node.node.metadata)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1ab9c333",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Same thing, with NOT IN\n",
    "filters = MetadataFilters(\n",
    "    filters=[\n",
    "        MetadataFilter(key=\"commit_date\", value=\"2023-08-01\", operator=\">=\"),\n",
    "        MetadataFilter(key=\"commit_date\", value=\"2023-08-15\", operator=\"<=\"),\n",
    "        MetadataFilter(\n",
    "            key=\"author\",\n",
    "            value=[\"mats@timescale.com\", \"sven@timescale.com\"],\n",
    "            operator=\"nin\",\n",
    "        ),\n",
    "    ],\n",
    "    condition=\"and\",\n",
    ")\n",
    "\n",
    "retriever = index.as_retriever(\n",
    "    similarity_top_k=10,\n",
    "    filters=filters,\n",
    ")\n",
    "\n",
    "retrieved_nodes = retriever.retrieve(\"What is this software project about?\")\n",
    "\n",
    "for node in retrieved_nodes:\n",
    "    print(node.node.metadata)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a46764cf",
   "metadata": {},
   "outputs": [],
   "source": [
    "# CONTAINS\n",
    "filters = MetadataFilters(\n",
    "    filters=[\n",
    "        MetadataFilter(key=\"fixes\", value=\"5680\", operator=\"contains\"),\n",
    "    ]\n",
    ")\n",
    "\n",
    "retriever = index.as_retriever(\n",
    "    similarity_top_k=10,\n",
    "    filters=filters,\n",
    ")\n",
    "\n",
    "retrieved_nodes = retriever.retrieve(\"How did these commits fix the issue?\")\n",
    "for node in retrieved_nodes:\n",
    "    print(node.node.metadata)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
